17 research outputs found

    Quantile Cross-Spectral Density: A Novel and Effective Tool for Clustering Multivariate Time Series

    Get PDF
    Financiado para publicación en acceso aberto: Universidade da Coruña/CISUG[Abstract] Clustering of multivariate time series is a central problem in data mining with applications in many fields. Frequently, the clustering target is to identify groups of series generated by the same multivariate stochastic process. Most of the approaches to address this problem include a prior step of dimensionality reduction which may result in a loss of information or consider dissimilarity measures based on correlations and cross-correlations but ignoring the serial dependence structure. We propose a novel approach to measure dissimilarity between multivariate time series aimed at jointly capturing both cross dependence and serial dependence. Specifically, each series is characterized by a set of matrices of estimated quantile cross-spectral densities, where each matrix corresponds to a pair of quantile levels. Then the dissimilarity between every couple of series is evaluated by comparing their estimated quantile cross-spectral densities, and the pairwise dissimilarity matrix is taken as starting point to develop a partitioning around medoids algorithm. Since the quantile-based cross-spectra capture dependence in quantiles of the joint distribution, the proposed metric has a high capability to discriminate between high-level dependence structures. An extensive simulation study shows that our clustering procedure outperforms a wide range of alternative methods and exhibits robustness to noise distribution besides being computationally efficient. A real data application involving bivariate financial time series illustrates the usefulness of the proposed approach. The procedure is also applied to cluster nonstationary series from the UEA multivariate time series classification archive.This research has been supported by the Ministerio de Economía y Competitividad (MINECO) grants MTM2017-82724-R and PID2020-113578RB-100, the Xunta de Galicia (Grupos de Referencia Competitiva ED431C-2020-14), and the Centro de Investigación del Sistema Universitario de Galicia “CITIC” grant ED431G 2019/01; all of them through the European Regional Development Fund (ERDF). This work has received funding for open access charge by Universidade da Coruña/CISUGXunta de Galicia; ED431C-2020-14Xunta de Galicia; ED431G 2019/0

    Machine learning for multivariate time series with the R package mlmts

    Get PDF
    Financiado para publicación en acceso aberto: Universidade da Coruña/CISUG[Abstract]: Time series data are ubiquitous nowadays. Whereas most of the literature on the topic deals with univariate time series, multivariate time series have typically received much less attention. However, the development of machine learning algorithms for the latter objects has substantially increased in recent years. The R package mlmts attempts to provide a set of widespread data mining techniques for multivariate series. Several functions allowing the execution of clustering, classification, outlier detection and forecasting methods, among others, are included in the package. mlmts also incorporates a collection of multivariate time series datasets often used to test the performance of new classification algorithms. The main characteristics of the package are described and its use is illustrated through various examples. Practitioners from a wide variety of fields could benefit from the general framework provided by mlmts.This research has been supported by the Ministerio de Economía y Competitividad (MINECO) grants MTM2017-82724-R and PID2020-113578RB-100, the Xunta de Galicia (Grupos de Referencia Competitiva ED431C-2020-14), and the Centro de Investigación del Sistema Universitario de Galicia, “CITIC” grant ED431G 2019/01; all of them through the European Regional Development Fund (ERDF). This work has received funding for open access charge by University of A Coruña/CISUG.Xunta de Galicia; ED431C-2020-14Xunta de Galicia; ED431G 2019/0

    Outlier Detection for Multivariate Time Series: A Functional Data Approach ®

    Get PDF
    Financiado para publicación en acceso aberto: Universidade da Coruña/CISUG[Abstract] A method for detecting outlier samples in a multivariate time series dataset is proposed. It is assumed that an outlying series is characterized by having been generated from a different process than those associated with the rest of the series. Each multivariate time series is described by means of an estimator of its quantile cross-spectral density, which is treated as a multivariate functional datum. Then an outlier score is assigned to each series by using functional depths. A broad simulation study shows that the proposed approach is superior to the alternatives suggested in the literature and demonstrates that the consideration of functional data constitutes a critical step. The procedure runs in linear time with respect to both the series length and the number of series, and in quadratic time with respect to the number of dimensions. Two applications concerning financial series and ECG signals highlight the usefulness of the technique.This research has been supported by the Ministerio de Economía y Competitividad (MINECO) grants MTM2017-82724-R and PID2020-113578RB-100, the Xunta de Galicia (Grupos de Referencia Competitiva ED431C-2020-14), and the Centro de Investigación del Sistema Universitario de Galicia “CITIC” grant ED431G 2019/01; all of them through the European Regional Development Fund (ERDF). This work has received funding for open access charge by Universidade da Coruña/CISUG .Xunta de Galicia; ED431C-2020-14Xunta de Galicia; ED431G 2019/0

    Ordinal Time Series Analysis with the R Package otsfeatures

    Get PDF
    [Abstract]: The 21st century has witnessed a growing interest in the analysis of time series data. While most of the literature on the topic deals with real-valued time series, ordinal time series have typically received much less attention. However, the development of specific analytical tools for the latter objects has substantially increased in recent years. The R package otsfeatures attempts to provide a set of simple functions for analyzing ordinal time series. In particular, several commands allowing the extraction of well-known statistical features and the execution of inferential tasks are available for the user. The output of several functions can be employed to perform traditional machine learning tasks including clustering, classification, or outlier detection. otsfeatures also incorporates two datasets of financial time series which were used in the literature for clustering purposes, as well as three interesting synthetic databases. The main properties of the package are described and its use is illustrated through several examples. Researchers from a broad variety of disciplines could benefit from the powerful tools provided by otsfeatures.This research has been supported by the Ministerio de Economía y Competitividad (MINECO) grants MTM2017-82724-R and PID2020-113578RB-100, the Xunta de Galicia (Grupos de Referencia Competitiva ED431C-2020-14), and the Centro de Investigación del Sistema Universitariode Galicia, “CITIC” grant ED431G 2019/01; all of them through the European Regional Development Fund (ERDF).Xunta de Galicia; ED431C-2020-14Xunta de Galicia; ED431G 2019/0

    F4: An All-Purpose Tool for Multivariate Time Series Classification

    Get PDF
    This article belongs to the Special Issue Data Mining for Temporal Data Analysis[Abstract] We propose Fast Forest of Flexible Features (F4), a novel approach for classifying multivariate time series, which is aimed to discriminate between underlying generating processes. This goal has barely been addressed in the literature. F4 consists of two steps. First, a set of features based on the quantile cross-spectral density and the maximum overlap discrete wavelet transform are extracted from each series. Second, a random forest is fed with the extracted features. An extensive simulation study shows that F4 outperforms some powerful classifiers in a wide variety of situations, including stationary and nonstationary series. The proposed method is also capable of successfully discriminating between electrocardiogram (ECG) signals of healthy subjects and those with myocardial infarction condition. Additionally, despite lacking shape-based information, F4 attains state-of-the-art results in some datasets of the University of East Anglia (UEA) multivariate time series classification archive.This research has been supported by the Ministerio de Economía y Competitividad (MINECO) grants MTM2017-82724-R and PID2020-113578RB-100, the Xunta de Galicia (Grupos de Referencia Competitiva ED431C-2020-14), and the Centro de Investigación del Sistema Universitario de Galicia, “CITIC” grant ED431G 2019/01; all of them through the European Regional Development Fund (ERDF). This work has received a discount in publication fees by Universidade da Coruña/CISUGXunta de Galicia; ED431C-2020-14Xunta de Galicia; ED431G 2019/0

    The Bootstrap for Testing the Equality of Two Multivariate Stochastic Processes with an Application to Financial Markets

    Get PDF
    [Abstract] The problem of testing the equality of generating processes of two multivariate time series is addressed in this work. To this end, we construct two tests based on a distance measure between stochastic processes. The metric is defined in terms of the quantile cross-spectral densities of both processes. A proper estimate of this dissimilarity is the cornerstone of the proposed tests. Both techniques are based on the bootstrap. Specifically, extensions of the moving block bootstrap and the stationary bootstrap are used for their construction. The approaches are assessed in a broad range of scenarios under the null and the alternative hypotheses. The results from the analyses show that the procedure based on the stationary bootstrap exhibits the best overall performance in terms of both size and power. The proposed techniques are used to answer the question regarding whether or not the dotcom bubble crash of the 2000s permanently impacted global market behavior.This research has been supported by MINECO (MTM2017-82724-R and PID2020-113578RB-100), the Xunta de Galicia (ED431C-2020-14), and “CITIC” (ED431G 2019/01)Xunta de Galicia; ED431C-2020-14Xunta de Galicia; ED431G 2019/0

    Herramientas estadísticas para el tratamiento de datos de estrellas binarias de la misión astrométrica GAIA

    Get PDF
    Traballo de Fin de Grao en Matemáticas. Curso 2018-2019[ES] Este documento pretende extraer conclusiones a partir del análisis de datos astronómicos, así como dar posibles estrategias para el tratamiento de los mismos. Uno de los campos en el que nos vamos a centrar es en el de las estrellas dobles, pues se cree que éstas constituyen aproximadamente un 70% de las estrellas totales de nuestra galaxia, la Vía Láctea; razón más que justificada para dedicarles una atención especial. Nuestro punto de partida va a ser la misión Gaia, lanzada por la Agencia Espacial Europea en el año 2013, y que tiene como objetivo trazar un mapa tridimensional de nuestra galaxia con una precisión nunca antes lograda. Los datos utilizados en nuestros análisis serán tomados de catálogos estelares relativos a esta misión, así como de catálogos previos ya validados, como el catálogo Hipparcos. Uno de los principales objetivos de este escrito es la validación de los datos de la misión Gaia relativos a estrellas dobles, para lo que se echa mano de contrastes estadísticos de distribuciones. Así mismo, también se pretende dar una idea general de lo complejo que puede ser el problema de la estimación de la distancia en Astronomía, cuya posible solución pasa por el planteamiento del mismo desde un enfoque probabilístico bayesiano.[EN] The aim of this paper is both to get conclusions from astronomical-data analysis and to show possible strategies in order to treat with them. One of the fields in which we are going to focus on is in double stars, since they are believed to constitute approximately the 70% of the total number of stars in our galaxy, the Milky Way. We thus consider the previous statement as a strong reason to give them special treatment. Our starting point is going to be the Gaia mission, launched by the European Space Agency in 2013, whose main aim is to make a highly accurate three dimensional map of our galaxy, never got before in the astronomical world. Data in our analysis are taken either from catalogues provided by the Gaia mission or from previous well-known catalogues as Hipparcos. One of the main goals of this letter is data validation related to double stars in the Gaia mission, carried out by using statististical contrasts on probability distributions. In addition, we try tho show the general complexity that estimating astronomical distances involves, where a bayesian approach in terms of probability could be needed

    Ordinal time series analysis with the R package otsfeatures

    Full text link
    The 21st century has witnessed a growing interest in the analysis of time series data. Whereas most of the literature on the topic deals with real-valued time series, ordinal time series have typically received much less attention. However, the development of specific analytical tools for the latter objects has substantially increased in recent years. The R package otsfeatures attempts to provide a set of simple functions for analyzing ordinal time series. In particular, several commands allowing the extraction of well-known statistical features and the execution of inferential tasks are available for the user. The output of several functions can be employed to perform traditional machine learning tasks including clustering, classification or outlier detection. otsfeatures also incorporates two datasets of financial time series which were used in the literature for clustering purposes, as well as three interesting synthetic databases. The main properties of the package are described and its use is illustrated through several examples. Researchers from a broad variety of disciplines could benefit from the powerful tools provided by otsfeatures

    Quantile-Based Fuzzy Clustering of Multivariate Time Series in the Frequency Domain

    Get PDF
    Financiado para publicación en acceso aberto: Universidade da Coruña/CISUG[Abstract] A novel procedure to perform fuzzy clustering of multivariate time series generated from different dependence models is proposed. Different amounts of dissimilarity between the generating models or changes on the dynamic behaviours over time are some arguments justifying a fuzzy approach, where each series is associated to all the clusters with specific membership levels. Our procedure considers quantile-based cross-spectral features and consists of three stages: (i) each element is characterized by a vector of proper estimates of the quantile cross-spectral densities, (ii) principal component analysis is carried out to capture the main differences reducing the effects of the noise, and (iii) the squared Euclidean distance between the first retained principal components is used to perform clustering through the standard fuzzy C-means and fuzzy C-medoids algorithms. The performance of the proposed approach is evaluated in a broad simulation study where several types of generating processes are considered, including linear, nonlinear and dynamic conditional correlation models. Assessment is done in two different ways: by directly measuring the quality of the resulting fuzzy partition and by taking into account the ability of the technique to determine the overlapping nature of series located equidistant from well-defined clusters. The procedure is compared with the few alternatives suggested in the literature, substantially outperforming all of them whatever the underlying process and the evaluation scheme. Two specific applications involving air quality and financial databases illustrate the usefulness of our approach.The authors are grateful to the anonymous referees for their comments and suggestions. The research of Ángel López-Oriona and José A. Vilar has been supported by the Ministerio de Economía y Competitividad (MINECO) grants MTM2017-82724-R and PID2020-113578RB-100, the Xunta de Galicia (Grupos de Referencia Competitiva ED431C-2020-14), and the Centro de Investigación del Sistema Universitario de Galicia “CITIC” grant ED431G 2019/01; all of them through the European Regional Development Fund (ERDF). This work has received funding for open access charge by Universidade da Coruña/CISUGXunta de Galicia; ED431C-2020-14Xunta de Galicia; ED431G 2019/0

    Fuzzy clustering of ordinal time series based on two novel distances with economic applications

    Full text link
    Time series clustering is a central machine learning task with applications in many fields. While the majority of the methods focus on real-valued time series, very few works consider series with discrete response. In this paper, the problem of clustering ordinal time series is addressed. To this aim, two novel distances between ordinal time series are introduced and used to construct fuzzy clustering procedures. Both metrics are functions of the estimated cumulative probabilities, thus automatically taking advantage of the ordering inherent to the series' range. The resulting clustering algorithms are computationally efficient and able to group series generated from similar stochastic processes, reaching accurate results even though the series come from a wide variety of models. Since the dynamic of the series may vary over the time, we adopt a fuzzy approach, thus enabling the procedures to locate each series into several clusters with different membership degrees. An extensive simulation study shows that the proposed methods outperform several alternative procedures. Weighted versions of the clustering algorithms are also presented and their advantages with respect to the original methods are discussed. Two specific applications involving economic time series illustrate the usefulness of the proposed approaches
    corecore